Skip to content

Conversation

dljsjr
Copy link
Member

@dljsjr dljsjr commented Jul 29, 2025

Type of Change

  • Bug fix

Checklist

  • I have performed a self-review of my code
  • I have commented my code in hard-to-understand areas
  • I have verified my changes by testing with a device or have communicated a plan for testing

Description of Change

We have historically ignored non-primary devices in bonded sets during discovery/onboarding, going back to the pre-Edge days.

Bonded sets are things like stereo pairs or home theater setups.

These devices cannot be controlled via the WSS LAN API at all, and they aren't treated as being part of a Group in the Sonos group model; the intent is for API consumers to treat the entire bonded set as a single "player". Only one of the devices in the set exposes an API endpoint, which differs from a Group where all the players expose an endpoint. While groups service most commands via calling the API endpoint on the Group Coordinator, certain commands like volume can be sent to non-coordinator players. This is not the case with bonded sets; the entire set always acts atomically, and only the primary device in the set can service commands. Hence, we treat non-primary devices as not controllable in a bonded set, and we don't onboard them.

However, we didn't really have very robust handling of devices that become part of a bonded set at runtime after onboarding. This led to unnecessary CPU and RAM usage by creating a few tight loops due to unexpected failure/edge cases when making certain calls.

Because we would fail to discover this device under normal circumstances, the preferred way to handle this transition is to mark the device as being offline. We discussed emitting a delete for these devices, but it seems that it wouldn't be too uncommon for some users to create and destroy bonded sets ephemerally the way one might do with a Group. For this reason we decided we would avoid deleting the records.

If a non-primary in a bonded set is removed from the bonded set, it will be marked as online again.

Summary of Completed Tests

Tested on my personal setup by overriding the driver files directly. Will need to be regression tested by internal QA once this lands on Alpha; the OAuth stuff precludes this from being tested using the PR channel invite.

Copy link

Copy link

github-actions bot commented Jul 29, 2025

Test Results

   69 files    449 suites   0s ⏱️
2 337 tests 2 337 ✅ 0 💤 0 ❌
3 930 runs  3 930 ✅ 0 💤 0 ❌

Results for commit e6a97d2.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Jul 29, 2025

Minimum allowed coverage is 90%

Generated by 🐒 cobertura-action against e6a97d2

@dljsjr dljsjr force-pushed the fix/sonos-bonded-set-handling branch from e357b8f to 55bd1a7 Compare July 29, 2025 22:50
@dljsjr dljsjr mentioned this pull request Jul 30, 2025
7 tasks
@@ -12,41 +14,48 @@ CapEventHandlers.PlaybackStatus = {
Playing = "PLAYBACK_STATE_PLAYING",
}

local function _do_emit(device, attribute_event)
Copy link
Member Author

@dljsjr dljsjr Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we collapsed everything down to this helper function is because certain activity payloads that come in on a single speaker will contain information about all Players in a group/set/zone/etc., and we will dispatch to all appropriate device records that can be associated with one of the payload entries.

Emitting information about that speaker if it's known to be part of a bonded set would lead to the device record being marked as Online in certain situations, which we don't want.

@dljsjr dljsjr force-pushed the fix/sonos-bonded-set-handling branch 2 times, most recently from c5ed33e to 99b6f2b Compare July 30, 2025 15:18
---
---@field public ssdp_task SonosPersistentSsdpTask?
---@field private ssdp_event_thread_handle table?
local SonosDriver = {}

---@param device SonosDevice
function SonosDriver:update_bonded_device_tracking(device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this function is very confusing to me, can you explain what is intended to be tracked here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to track when a device record's bonded status changes.

The device's current bonded status is tracked on the device itself via the BONDED field, which we always update as soon as possible so that we can do stuff like short circuiting emits so that we don't mark an offline device as online.

The driver shadows this information in a map that is updated independently and with less urgency, so that it can be used to look up the previous state before updating itself.

This lets us determine if a device's bonded status changes. If a device becomes bonded, we mark it offline. If a device was bonded and is no longer bonded, we re-initialize it.

"Player %s is a non-primary bonded Sonos device, ignoring",
info.discovery_info.device.name
)
-- if then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented code

@@ -358,19 +365,28 @@ function SonosConnection.new(driver, device)
local household_id, current_coordinator =
self.driver.sonos:get_coordinator_for_device(self.device)
local _, player_id = self.driver.sonos:get_player_for_device(self.device)
self.driver.sonos:update_household_info(header.householdId, body, self.device)
self.driver.sonos:update_household_info(header.householdId, body, self.driver)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change seems odd, but looking at the function definition change, it looks like the device was just an extra arg before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks weird on the surface.

before the OAuth refactor, this method would conditionally take a device record as a flag that it needed to perform some updates. Which was a bad heuristic, so when I refactored it, I got rid of that heuristic, but forgot to remove it from the call sites; Lua being Lua, it didn't throw any errors or cause any problems.

With the refactor for the bonded sets, we changed it so that it always takes a driver record to be able to do some tracking of bonded/unbonded state transitions, hence the updates to the call sites.

self.driver.sonos:update_device_record_from_state(header.householdId, self.device)
local _, updated_coordinator = self.driver.sonos:get_coordinator_for_device(self.device)

local bonded = self.device:get_field(PlayerFields.BONDED)
if bonded then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be combined with the if not bonded below just to simplify the logic?

Copy link
Member Author

@dljsjr dljsjr Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not easily; we want to conditionally call :stop(), but then always call cleanup_unused_sockets, and we want to stop the bonded player entirely before calling that function.

The reason being is that cleanup_unused_sockets executes the cleanup logic for when the set of known Coordinators change, so that we get rid of certain namespace subscriptions and connections if a player that was a coordinator is no longer a coordinator.

This cleanup logic is applicable on any Group membership update for the topology regardless of whether or not bonded set membership shifts.

After we do the cleanup, we then want to validate our own coordinator connection (if our coordinator changed) and subscriptions only in the case that we're not part of a bonded set.

@@ -300,7 +300,7 @@ function sonos_ssdp.spawn_persistent_ssdp_task()
if info_to_send then
if not (info_to_send.discovery_info and info_to_send.discovery_info.device) then
log.error_with(
{ hub_logs = true },
{ hub_logs = false },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't want to include this in hub logs?

Copy link
Member Author

@dljsjr dljsjr Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be chatty, and we've been asked by trim down the amount of logging that the LAN Edge Drivers forward to hub-core. It was set to true during initial development of the refactor for debugging purposes but doesn't really need to be that way anymore. With the ability to set the broker.promoted_driver_logs flag in LD to promote all logs for a driver, there isn't as much need for this in general unless it's on a path that is really pathological, IMO.

"Error handling Sonos player initialization: %s, error code: %s",
error,
(error_code or "N/A")
{ hub_logs = false },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about not including this in hub logs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be chatty as well, and based on logs that I've looked at from the fleet, it's almost always only hit by non-primary bonded players, so it's not providing useful information.

Copy link
Contributor

@cjswedes cjswedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NoahCornell
Copy link
Contributor

The changes here look good to me. I can get another speaker and do some testing so that we can merge this.

I'll work on getting a test plan that would cover this and my sonos PR so that thinktank can run through both set of changes in alpha.

@@ -23,6 +22,9 @@ function _household_mt:reset()
self.groups = utils.new_case_insensitive_table()
self.players = utils.new_case_insensitive_table()
self.player_to_group = utils.new_case_insensitive_table()
if not self.bonded_players then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe add a comment clarifying the distinction here like -- previously bonded devices should not be un-bonded after a reset since these should not be treated as distinct devices

Comment on lines +105 to +109
if not group_id or #group_id == 0 then
group_id = household.player_to_group[player_id or ""] or ""
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why even set group_id initially if we can do this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should only be case if the device is bonded and in most cases it would be better to rely on what we just got from the device instead of our cached value

Comment on lines 120 to 122
local player_tbl = household.players[player_id]
local player = (player_tbl or {}).player
local sonos_device = (player_tbl or {}).device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
local player_tbl = household.players[player_id]
local player = (player_tbl or {}).player
local sonos_device = (player_tbl or {}).device
local player = (household.players[player_id] or {}).player
local sonos_device = (household.players[player_id] or {}).device

@@ -124,15 +131,24 @@ function SonosState:associate_device_record(device, info)
return
end

household.st_devices[player.id] = device.id
household.st_devices[sonos_device.id] = device.id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, if these both map to the playerId, why even change this? Mostly curious if it's possible for these values to be different, and what that might mean. Should we add a check for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can be different. I think the player id would be the same for a stereo pair (bonded devices) but the device id would be unique for each of them

Comment on lines +258 to +261
if bonded then
device:offline()
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this and other places, maybe it would make more sense to move this to the top and early return? could avoid issues in the future with this arbitrary-ish status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we still want to update the device fields in this function if they have changed so I don't think we can return early

local group_id
-- non-primary bonded players are excluded from a group's list of PlayerID's so we use the group membership
-- of the primary device
if type(device.primaryDeviceId) == "string" and #device.primaryDeviceId > 0 then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if type(device.primaryDeviceId) == "string" and #device.primaryDeviceId > 0 then
if type(device.primaryDeviceId) == "string" and device.primaryDeviceId ~= "" then

might be more clear

We have historically ignored non-primary devices in bonded sets during
discovery/onboarding, going back to the pre-Edge days. 

Bonded sets are things like stereo pairs or home theater setups.

These devices cannot be controlled via the WSS LAN API at all, and they
aren't treated as being part of a Group in the Sonos group model; the
intent is for API consumers to treat the entire bonded set as a single
"player". Only one of the devices in the set exposes an API endpoint,
which differs from a Group where all the players expose an endpoint.
While groups service most commands via calling the API endpoint on the
Group Coordinator, certain commands like volume can be sent to
non-coordinator players. This is not the case with bonded sets; the
entire set always acts atomically, and only the primary device in the
set can service commands. Hence, we treat non-primary devices as not
controllable in a bonded set, and we don't onboard them.

However, we didn't really have very robust handling of devices that
become part of a bonded set at runtime after onboarding. This led to
unnecessary CPU and RAM usage by creating a few tight loops due to
unexpected failure/edge cases when making certain calls.

Because we would fail to discover this device under normal
circumstances, the preferred way to handle this transition is to mark
the device as being offline. We discussed emitting a delete for these
devices, but it seems that it wouldn't be too uncommon for some users to
create and destroy bonded sets ephemerally the way one might do with a
Group. For this reason we decided we would avoid deleting the records.

If a non-primary in a bonded set is removed from the bonded set, it will
be marked as online again.
@NoahCornell NoahCornell force-pushed the fix/sonos-bonded-set-handling branch from 99b6f2b to 09eb03d Compare September 8, 2025 14:16
@NoahCornell NoahCornell force-pushed the fix/sonos-bonded-set-handling branch from 09eb03d to e6a97d2 Compare September 8, 2025 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants